ISSS608 Assignment: VAST Challenge 2021 (Mini-Challenge 2)

Investigating the Mini-Challenge 2 of VAST Challenge 2021

Syed Ahmad Zaki https://www.google.com/
07-12-2021

Team Member:
Syed Ahmad Zaki, Singapore Management University of Singapore,
Student Team: YES

Tools Used:
Rmarkdown

Approximately how many hours were spent working on this submission in total?
Provide an estimate of the total number of hours worked on this submission by your entire team.

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2021 is complete?
YES

Video
Provide a link to your video. Example: http://www.westbirmingham.ac.uk/uwb-smith-mc2-video.wmv

1. Introduction

1.1 Our Mission (Should We Accept It!)

As a visual analytics expert assisting law enforcement, your mission is to identify which GASTech employees made which purchases and identify suspicious patterns of behavior. You must cope with uncertainties that result from missing, conflicting, and imperfect data to make recommendations for further investigation.

Use visual analytics to analyze the available data and develop responses to the questions below. In addition, prepare a video that shows how you used visual analytics to solve this challenge. Submission instructions are available here. Entry forms are available for download here.

2. Literature Review

A cursory look at the dataset reveals the following data types:

Data Type Description
Credit Card.csv Aspatial Credit card txns by timestamp, location and amt
Loyalty.csv Aspatial Loyalty card txns by date, location and amt
Car Assignment.csv Aspatial Car assignment ID with individuals’ name and role
MC2.jpg Aspatial Abila’s map in jpeg format
MC2.tif Geospatial Abila’s map in a geotiff format
GPS.csv Geospatial GPS points (latlong) by car ID and timestamp
Abila Geospatial Abila’s road network
Kronos Island Geospatial Polygon showing Kronos Island’s admin boundary

An in-depth look at the dataset reveals the following fields:

File Name cc loyalty gps car mc2
File Type csv csv csv csv pic
Count 1,490 1,392 685,169 44 -
Date Format m/d/y m/d/y m/d/y - -
Time Format h:m - h:m:s - -
Location Yes Yes - - Yes
Price Yes Yes - - -
last4ccnum Yes - - - -
loyaltynum - Yes - - -
ID - - Yes Yes -
Latlong - - Yes - -
Names - - - Yes -
Employment Details - - - Yes -

Not all files have the same fields. While it’s easy to merge gps and car data using its unique ID, there are no unique fields tying the cc and loyalty data together. Thus, merging both cc and loyalty data would require some form of fuzzy joining logic. Separately, we would need to identify the various locations within the gps data, using both the cc and mc2 map.

With these dataset in mind, the following considerations would need to be addressed:

2.1 Fuzzy Matching

There are a few ways to employ fuzzy matching in our dataset. One is to use the native adist function within R, but it takes a long while to process. The other is to use packages specifically designed for fuzzy matching. One such package that is built for speed in matching similar phrases is stringdist. It uses openMP for parallel computing to speed up its matching of unequal content. The only downside (though it’s hardly a downside) is that it requires the columns of comparison to be housed in the same dataframe. fuzzyjoin, built on top of stringdist, allows comparison of columns housed in different dataset, and its output include a merging of both datasets.

Unfortunately, deciding on the fuzzy logic package is the easy part. The harder part is to decide on the appropriate fuzzy join logic. Here’s a list of distance metrics currently supported by stringdist:

Method Name Description
osa Optimal string aligment, (restricted Damerau-Levenshtein distance)
lv Levenshtein distance (as in R’s native adist).
dl Full Damerau-Levenshtein distance.
hamming Hamming distance (a and b must have same nr of characters).
lcs Longest common substring distance
qgram q-gram distance
cosine cosine distance between q-gram profiles
jaccard Jaccard distance between q-gram profiles
jw Jaro, or Jaro-Winkler distance

Out of the above methods, osa, lv and dl seems most apt, since we’re dealing with phrases with differing lengths and are more concerned with slight edits, realignment, addition and subtraction of letters within these phrases. We’ll rely on the osa method since it’s a balance between finding the right edits and speed.

2.2 Map Visualisations of GPS Data

ABC

2.3 Non-Map Visualisations of GPS Data

ABC Calendar Heatmap (GPS Points per day) http://visualdata.wustl.edu/varepository/VAST%20Challenge%202014/challenges/MC2%20-%20Patterns%20of%20Life%20Analysis/entries/University%20of%20Buenos%20Aires%20-%20Tralice/ Parallel Coordinates Graph (Location, Date, Timestamp, Price, Name) https://rpubs.com/tskam/PCP Gantt Chart Explore Hippokampos RadViz Clock Average Location https://www.cs.umd.edu/hcil/varepository/VAST%20Challenge%202014/challenges/MC2%20-%20Patterns%20of%20Life%20Analysis/entries/Central%20South%20University/

3. Data Understanding

We start by loading all the necessary datasets provided in the VAST Challenge 2021 Mini-Challenge 2.

Show code
# Loading all datasets and image
cc <- readr::read_csv("data/cc_data.csv") # Add credit card data
loyalty <- readr::read_csv("data/loyalty_data.csv") # Add loyalty data
mc2 <- raster("data/MC2-tourist_modified.tif") # Add tif file as a raster layer
gps <- readr::read_csv("data/gps.csv") # Add gps data
car <- readr::read_csv("data/car-assignments.csv") # Add car assignments
Abila_st <- st_read(dsn = "data", layer = "Abila")
Kronos_sf <- st_as_sf(st_read(dsn = "data", layer = "Kronos_Island"))

As always, we review each dataset in greater detail. This is a necessary step in order to accurately prepare the data for subsequent use.

While reviewing the four csv data, we immediately noticed a few discrepancies:
1. Date format within the timestamp were in a MM-DD-YYYY H:M format
2. Katerina’s Cafe contains unique characters, which may cause downstream problems during our analysis
3. ID and Last4CCNum are treated as regular double numbers, instead of a character type
4. ABC

Show code
#--------------- Cleaning CC data ---------------

cc$timestamp <- date_time_parse(cc$timestamp,
                zone = "UTC",
                format = "%m/%d/%Y  %H:%M") # Readjust CC timestamp
cc[grep("Katerina", cc$location),2] <- "Katerina's Cafe" # Replace unique characters in Katerina's Cafe
cc$last4ccnum <- as_factor(cc$last4ccnum) # Change the column format to nominal format
cc$hour <- as.numeric(format(cc$timestamp,"%H")) # Create a separate column just for hours in the cc data
cc$period <- case_when( # Segment hour into 5 separate periods
  cc$hour >= 21 ~ "Late Evening 9pm to 11.59pm",
  cc$hour >= 18 ~ "Evening 6pm to 8.59pm",
  cc$hour >= 12 ~ "Afternoon 12noon to 5.59pm",
  cc$hour >= 6 ~ "Morning 6am to 11.59am",
  TRUE ~ "Late Night 12mn to 5.59am"
)
cc$dayofmonth <- day(cc$timestamp) # Extract day of month from timestamp in a new column
cc$dayofmonth <- as_factor(cc$dayofmonth) # Change to nominal format
cc$weekday <- wday(cc$timestamp, label = TRUE) # Extract day of week from timestamp in a new column
cc <- tibble::rowid_to_column(cc, "ID") # Create a numeric id column
cc$date <- as.Date(cc$timestamp) # Create a separate column just for dates in the cc data
cc$concat_cc_loyalty <- paste(cc$date,cc$location,cc$price) # Create a separate column of unique values using concatenated values in the cc data
cc$concat_cc_spots <- paste(cc$date,cc$location,cc$hour) # Create a second separate column of unique values using concatenated values in the cc data
cc$ID <- as_factor(cc$ID) # Change the column format to nominal format

#--------------- Cleaning Loyalty data ---------------

loyalty$timestamp <- date_time_parse(loyalty$timestamp,
                zone = "UTC",
                format = "%m/%d/%Y") # Readjust loyalty timestamp
loyalty[grep("Katerina", loyalty$location),2] <- "Katerina's Cafe" # Replace unique characters in Katerina's Cafe
loyalty$dayofmonth <- day(loyalty$timestamp) # Extract day of month from timestamp in a new column
loyalty$dayofmonth <- as_factor(loyalty$dayofmonth) # Change to nominal format
loyalty$weekday <- wday(loyalty$timestamp, label = TRUE) # Extract day of week from timestamp in a new column
loyalty$concat_loyalty_cc <- paste(loyalty$timestamp,loyalty$location,loyalty$price) # Create a separate column of unique values using concatenated values in the loyalty data
loyalty <- tibble::rowid_to_column(loyalty, "ID") # Create a numeric id column
loyalty$ID <- as_factor(loyalty$ID) # Change the column format to nominal format

#--------------- Cleaning Car Assignment data ---------------

car$CarID <- as_factor(car$CarID) # Change the column format to nominal format
car$FullName <- paste(car$FirstName,car$LastName, sep = " ") # Create new column with combined first and last name
car$RoleNName <- paste(car$CarID, car$CurrentEmploymentTitle, car$FullName, sep = " ")

#--------------- Cleaning GPS data ---------------

gps$id <- as_factor(gps$id) # Change the column format to nominal format
gps$Timestamp <- date_time_parse(gps$Timestamp,
                zone = "UTC",
                format = "%m/%d/%Y %H:%M:%S") # Readjust loyalty timestamp
gps$date <- as_date(gps$Timestamp) # Create a separate column just for dates in the gps data
gps$hour <- hour(gps$Timestamp) # Create a separate column just for hours in the gps data
gps$period <- case_when( # Segment hour into 5 separate periods
  gps$hour >= 21 ~ "Late Evening 9pm to 11.59pm",
  gps$hour >= 18 ~ "Evening 6pm to 8.59pm",
  gps$hour >= 12 ~ "Afternoon 12noon to 5.59pm",
  gps$hour >= 6 ~ "Morning 6am to 11.59am",
  TRUE ~ "Late Night 12mn to 5.59am"
)
gps$dayofmonth <- day(gps$Timestamp) # Extract day of month from timestamp in a new column
gps$day <- as_factor(get_day(gps$Timestamp))
gps$weekday <- wday(gps$Timestamp, label = TRUE) # Extract day of week from timestamp in a new column

4. Data Preparation

4.1 Combining Both Credit Card and Loyalty Data Using Fuzzy Join (OSA)

We will now attempt to find matching rows between the cc and loyalty data.

Show code
cc_loyalty <- cc %>% # Create a new dataframe showing matches with a max distance difference of 1
  stringdist_inner_join(loyalty, 
                        by = c("concat_cc_loyalty" = "concat_loyalty_cc"),
                        method = "osa",
                        max_dist = 1,
                        distance_col = "distance")

cc_loyalty_1 <- cc_loyalty %>% # Isolate best matching cc and loyalty with more than 2 counts
    group_by(last4ccnum,loyaltynum) %>%
  count() %>%
  filter(n>2) %>%
  ungroup()

cc_loyalty_duplicate_cc <- cc_loyalty_1 %>% # Extract duplicates in cc data
  filter(cc_loyalty_1$last4ccnum == cc_loyalty_1$last4ccnum[duplicated(cc_loyalty_1$last4ccnum)])

cc_loyalty_duplicate_loyalty <- subset(cc_loyalty_1,loyaltynum == "L6267" | loyaltynum == "L3288") # Extract duplicates in loyalty data

4.2 Combining Both GPS and Car Assignment Data

4.2.1 Initial Step

First, we will merge the GPS data with the car assignments. Next, we will isolate GPS points, that have been stationary for at least 10 mins.

Show code
# GPS Data Manipulation
gps_name <- left_join(gps,car, by = c("id" = "CarID")) # Merge car assignments to gps data
#gps_name$RoleNName <- paste(gps_name$id, gps_name$CurrentEmploymentTitle, #gps_name$FullName, sep = " ")
gps_name$Timestamp <- as.POSIXct(gps_name$Timestamp, format = "%m/%d/%Y  %H:%M:%S", tz = "GMT") # Timestamp switching to month-day-year format
gps_name <- gps_name[with(gps_name,order(id,Timestamp)),] # Sort first by ID in ascending order and then Timestamp by oldest to newest
gps_name <- gps_name %>% # Add running number in the first column
  mutate(No = 1:n()) %>% 
  dplyr::select(No, everything()) 
gps_name <- gps_name %>% # Create additional column indicating time taken from previous timestamp for same ID
    mutate(Delta = Timestamp - lag(Timestamp, default = first(Timestamp)))
gps_name$Delta <- as.numeric(gps_name$Delta) # Convert Delta column to numeric format
gps_name$Delta_Hours <- round(gps_name$Delta / 60 / 60, 1) # Create column to convert Delta seconds into hours with one decimal place

spots <- gps_name %>% # Filtering out gps coordinates where stationary for more than 10 mins
  filter(Delta > 600)
spots$No <- rep(1:2965, times = 1) # Redo running number in the first column

4.2.2 Identifying Stationary GPS Points

Next, using the map and other data sources, we identify the locations of each of these stationary GPS points. Through a visual inspection of the map, credit card and loyalty data, we found 66 unique locations.

Show code
spots$Location <- 1 # Create a Location column
spots <- spots %>% mutate(
  Location = case_when(
    between(lat, 36.05092013, 36.05102938) & 
      between(long, 24.82586806, 24.82598723)  ~ "Abila Airport", # 35 features
    between(lat, 36.07434876, 36.07443715) & 
      between(long, 24.84592966, 24.84598782)  ~ "Abila Scrapyard", # 4 features
    between(lat, 36.06342076, 36.06349309) & 
      between(long, 24.85096457, 24.85103679)  ~ "Abila Zacharo", # 66 features
    between(lat, 36.07712237, 36.07715385) & 
      between(long, 24.87617634, 24.87621582)  ~ "Ahaggo Museum", # 5 features
    between(lat, 36.07522801, 36.07530344) & 
      between(long, 24.85626503, 24.85634849)  ~ "Albert's Fine Clothing", # 20 features
    between(lat, 36.08172086, 36.08182543) & 
      between(long, 24.85086882, 24.85096705)  ~ "Bean There Done That", # 46 features
    between(lat, 36.05402149, 36.05413903) & 
      between(long, 24.90116515, 24.90128202)  ~ "Brew've Been Served", # 106 features
    between(lat, 36.07332048, 36.07336116) & 
      between(long, 24.86416419, 24.86420583)  ~ "Brewed Awakenings", # 36 features
    between(lat, 36.06582469, 36.065941) & 
      between(long, 24.90097567, 24.90108865)  ~ "20 Building Control Stenig's Home", # 20 features
    between(lat, 36.05851786, 36.05860144) & 
      between(long, 24.8808655, 24.88092654)  ~ "Carlyle Chemical Inc.", # 30 features
    between(lat, 36.07818062, 36.07821857) & 
      between(long, 24.87211555, 24.8721508)  ~ "4 CFO Ingrid's Home", # 27 features
    between(lat, 36.07682044, 36.07685752) & 
      between(long, 24.8658641, 24.86589901)  ~ "10 CIO Ada's Home", # 35 features
    between(lat, 36.0721156, 36.07215701) & 
      between(long, 24.87458425, 24.8746267)  ~ "32 COO Orhan's Home", # 29 features
    between(lat, 36.07062423, 36.07073983) & 
      between(long, 24.89517609, 24.89526281)  ~ "Chostus Hotel", # 11 features
    between(lat, 36.05462322, 36.05469486) & 
      between(long, 24.88977034, 24.88983886)  ~ "Coffee Cameleon", # 29 features
    between(lat, 36.08954231, 36.08962196) & 
      between(long, 24.86066508, 24.8607611)  ~ "Desafio Golf Course", # 10 features
    between(lat, 36.07292088, 36.07301365) & 
      between(long, 24.88396447, 24.88405897)  ~ "26 Drill Site Manager Marin's Home", # 26 features
    between(lat, 36.08442031, 36.08449538) & 
      between(long, 24.86416741, 24.8642387)  ~ "7 Drill Technician Elsa's Home", # 25 features
    between(lat, 36.08424703, 36.08432477) & 
      between(long, 24.8563809, 24.8564637)  ~ "9 Drill Technician Gustav's Home", # 13 features
    between(lat, 36.0726185, 36.07380904) & 
      between(long, 24.87510166, 24.87613744)  ~ "28 Drill Technician Isande's Home", # 26 features
    between(lat, 36.06922564, 36.06931513) & 
      between(long, 24.88416486, 24.88426267)  ~ "27 Drill Technician Kare's Home", # 20 features
    between(lat, 36.08542073, 36.08550845) & 
      between(long, 24.86036422, 24.86045943)  ~ "2 Engineer Lars's Home", # 37 features
    between(lat, 36.08664252, 36.08672442) & 
      between(long, 24.85756416, 24.85766744)  ~ "3 Engineer Felix's Home", # 22 features
    between(lat, 36.07622023, 36.07626546) & 
      between(long, 24.87466429, 24.87471053)  ~ "35 Environmental Safety Advisor Willem's Home", # 33 features
    between(lat, 36.07212045, 36.07213193) & 
      between(long, 24.84132949, 24.84134818)  ~ "Frank's Fuel", # 2 features
    between(lat, 36.05492145, 36.05503511) & 
      between(long, 24.90176782, 24.90188061)  ~ "Frydos Autosupply n' More", # 29 features
    between(lat, 36.04802098, 36.04805422) & 
      between(long, 24.87956497, 24.87957691)  ~ "GasTech", # 738 features
    between(lat, 36.05970763, 36.05981097) & 
      between(long, 24.85797552, 24.8580772)  ~ "Gelatogalore", # 47 features
    between(lat, 36.06034564, 36.06043016) & 
      between(long, 24.85646426, 24.85657454)  ~ "General Grocer", # 12 features
    between(lat, 36.05572125, 36.05584094) & 
      between(long, 24.90246542, 24.90258487)  ~ "Guy's Gyros", # 143 features
    between(lat, 36.06362146, 36.06371539) & 
      between(long, 24.88586605, 24.88595859)  ~ "Hallowed Grounds", # 70 features
    between(lat, 36.07660977, 36.07669909) & 
      between(long, 24.85756408, 24.85764247)  ~ "Hippokampos", # 155 features
    between(lat, 36.08412146, 36.08420924) & 
      between(long, 24.85896842, 24.85905081)  ~ "11 Hydraulic Technician Axel's Home", # 23 features
    between(lat, 36.08782802, 36.08793196) & 
      between(long, 24.85627136, 24.8563725)  ~ "19 Hydraulic Technician Vira's Home", # 24 features
    between(lat, 36.06641679, 36.06650723) & 
      between(long, 24.88256875, 24.88265687)  ~ "1 IT Helpdesk Nils's Home", # 31 features
    between(lat, 36.06729646, 36.06736745) & 
      between(long, 24.87788423, 24.87795559)  ~ "5 IT Technician Isak's Home", # 21 features
    between(lat, 36.06722012, 36.06731624) & 
      between(long, 24.8858687, 24.88596759)  ~ "8 IT Technician Lucas's Home", # 23 features
    between(lat, 36.06749651, 36.0675518) & 
      between(long, 24.87330651, 24.873366)  ~ "Jack's Magical Beans", # 31 features
    between(lat, 36.06582037, 36.06584879) & 
      between(long, 24.85236427, 24.85241027)  ~ "Kalami Kafenion", # 47 features
    between(lat, 36.05442247, 36.05453641) & 
      between(long, 24.89986596, 24.89998054)  ~ "Katerina’s Cafe", # 158 features
    between(lat, 36.05292229, 36.05296701) & 
      between(long, 24.84936915, 24.84941679)  ~ "Kronos Capital", # 6 features
    between(lat, 36.06582196, 36.06587998) & 
      between(long, 24.8497762, 24.84983936)  ~ "Kronos Mart", # 9 features
    between(lat, 36.06523446, 36.06534083) & 
      between(long, 24.83307421, 24.83318494)  ~ "Kronos Pipe and Irrigation", # 7 features
    between(lat, 36.06402993, 36.06410072) & 
      between(long, 24.84137818, 24.84144338)  ~ "Maximum Iron and Steel", # 9 features
    between(lat, 36.05840347, 36.05849041) & 
      between(long, 24.88546548, 24.88553455)  ~ "Nationwide Refinery", # 41 features
    between(lat, 36.05859158, 36.05859887) & 
      between(long, 24.85790261, 24.85799357)  ~ "Octavio's Office Supplies", # 3 features
    between(lat, 36.05192066, 36.05197575) & 
      between(long, 24.87076418, 24.87082137)  ~ "Ouzeri Elian", # 67 features
    between(lat, 36.06764972, 36.06775002) & 
      between(long, 24.90243213, 24.9025445)  ~ "34 Perimeter Control Edvard's Home", # 20 features
    between(lat, 36.06324941, 36.06330782) & 
      between(long, 24.85226894, 24.8523291)  ~ "Roberts and Sons", # 9 features
    between(lat, 36.05942407, 36.05952152) & 
      between(long, 24.89476557, 24.8948649)  ~ "Shared Home A - 6 Linnea 25 Kanon 29 Bertrand", # 72 features
    between(lat, 36.06332304, 36.06343537) & 
      between(long, 24.89607033, 24.89617856)  ~ "Shared Home B - 14 Lidelse 18 Birgitta 21 Hennie", # 60 features
    between(lat, 36.06242283, 36.06253955) & 
      between(long, 24.89877023, 24.89888179)  ~ "Shared Home C - 17 Sven 24 Minke 33 Brand", # 68 features
    between(lat, 36.05842222, 36.05853828) & 
      between(long, 24.90096522, 24.90107874)  ~ "Shared Home D - 22 Adra 23 Varja 30 Felix", # 73 features
    between(lat, 36.0603222, 36.06044736) & 
      between(long, 24.90556693, 24.90569385)  ~ "Shared Home E - 13 Inga 15 Loreto 16 Isia 21 Hennie", # 85 features
    between(lat, 36.05282139, 36.05288367) & 
      between(long, 24.86856868, 24.8686314)  ~ "Shoppers' Delight", # 17 features
    between(lat, 36.06772112, 36.06784956) & 
      between(long, 24.89906521, 24.89917328)  ~ "12 Site Control Hideki's Home", # 21 features
    between(lat, 36.05409586, 36.05420832) & 
      between(long, 24.90806584, 24.90817838)  ~ "Stewart and Sons Fabrication", # 36 features
    between(lat, 36.06774029, 36.06776587) & 
      between(long, 24.87148791, 24.87150031)  ~ "U-Pump", # 4 features
    between(lat, 36.05012433, 36.05021624) & 
      between(long, 24.9003978, 24.90047475)  ~ "Anonymous Site 1", # 6 features
    between(lat, 36.06314781, 36.06324321) & 
      between(long, 24.90010823, 24.90018668)  ~ "Anonymous Site 2", # 7 features
    between(lat, 36.05893131, 36.05900826) & 
      between(long, 24.89277554, 24.89284962)  ~ "Anonymous Site 3", # 7 features
    between(lat, 36.08061881, 36.08067087) & 
      between(long, 24.84681621, 24.84688282)  ~ "Anonymous Site 4", # 7 features
    between(lat, 36.06944928, 36.0695319) & 
      between(long, 24.84147082, 24.84157048)  ~ "Anonymous Site 5", # 8 features
    between(lat, 36.05149231, 36.05253234) & 
      between(long, 24.87495168, 24.87611086)  ~ "Anonymous Site 6", # 13 features
    between(lat, 36.05543848, 36.05657576) & 
      between(long, 24.86618187, 24.86735)  ~ "Anonymous Site 7", # 7 features 
    between(lat, 36.07099038, 36.07200089) & 
      between(long, 24.86869468, 24.86985682)  ~ "Anonymous Site 8", # 10 features 
    ))

spots$concat_spots_cc <- paste(spots$date,spots$Location,spots$hour) # Create a separate column of unique values using concatenated values in the distilled GPS data
spots_median <- spots %>% # Extract median latlong of locations
  group_by(Location) %>%
    summarise(lat.median = median(lat), long.median = median(long)) %>%
  filter(!is.na(Location)) %>% # Exclude remaining few unmatched locations
  ungroup()

spots_median <- spots_median %>% # Add additional column to classify locations into major buckets
  mutate(Location.Type = case_when(
    Location %in% c("Anonymous Site 1",
                    "Anonymous Site 2",
                    "Anonymous Site 3",
                    "Anonymous Site 4",
                    "Anonymous Site 5",
                    "Anonymous Site 6",
                    "Anonymous Site 7",
                    "Anonymous Site 8") ~ "Unknown",
    Location %in% c("Bean There Done That",
                    "Brew've Been Served",
                    "Brewed Awakenings",
                    "Coffee Cameleon",
                    "Jack's Magical Beans",
                    "Hallowed Grounds") ~ "Coffee Cafe",
    Location %in% c("Abila Zacharo",
                    "Gelatogalore",
                    "Guy's Gyros",
                    "Hippokampos",
                    "Kalami Kafenion",
                    "Katerina’s Café",
                    "Ouzeri Elian") ~ "Food Joints",
    Location %in% c("GasTech") ~ "HQ",
    Location %in% c("1 IT Helpdesk Nils's Home",
                    "10 CIO Ada's Home",
                    "11 Hydraulic Technician Axel's Home",
                    "12 Site Control Hideki's Home",
                    "19 Hydraulic Technician Vira's Home",
                    "2 Engineer Lars's Home",
                    "20 Building Control Stenig's Home",
                    "26 Drill Site Manager Marin's Home",
                    "27 Drill Technician Kare's Home",
                    "28 Drill Technician Isande's Home",
                    "3 Engineer Felix's Home",
                    "32 COO Orhan's Home",
                    "34 Perimeter Control Edvard's Home",
                    "35 Environmental Safety Advisor Willem's Home",
                    "4 CFO Ingrid's Home",
                    "5 IT Technician Isak's Home",
                    "7 Drill Technician Elsa's Home",
                    "8 IT Technician Lucas's Home",
                    "9 Drill Technician Gustav's Home",
                    "Shared Home A - 6 Linnea 25 Kanon 29 Bertrand",
                    "Shared Home B - 14 Lidelse 18 Birgitta 21 Hennie",
                    "Shared Home C - 17 Sven 24 Minke 33 Brand",
                    "Shared Home D - 22 Adra 23 Varja 30 Felix",
                    "Shared Home E - 13 Inga 15 Loreto 16 Isia 21 Hennie") ~ "Residential",
    Location %in% c("Abila Scrapyard",
                    "Carlyle Chemical Inc.",
                    "Kronos Pipe and Irrigation",
                    "Maximum Iron and Steel",
                    "Nationwide Refinery",
                    "Stewart and Sons Fabrication") ~ "Industrial",    
    Location %in% c("Ahaggo Museum",
                    "Albert's Fine Clothing",
                    "Kronos Mart",
                    "Octavio's Office Supplies",
                    "Shoppers' Delight",
                    "General Grocer",
                    "Roberts and Sons") ~ "Leisure & Shopping",
    Location %in% c("Abila Airport",
                    "Chostus Hotel",
                    "Desafio Golf Course",
                    "Kronos Capital") ~ "Complex",
    Location %in% c("Frank's Fuel",
                    "Frydos Autosupply n' More",
                    "U-Pump") ~ "Transport",
    ))

#sum(is.na(spots$Location))
#length(grep("Frydos Autosupply n' More", spots$Location))
#write_csv(spots,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_latest.csv")
#write_csv(spots_median,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_median.csv")
#write_csv(cc,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_date.csv")

4.3 Combining Both Credit Card and Spots Data Using Fuzzy Join (OSA)

We will now attempt to find matching rows between the cc and spot data.

ABC Need to remove five duplicates at

Show code
cc_spots <- cc %>% # Create a new dataframe that shows matches with a max distance difference of 1
  stringdist_inner_join(spots, 
                        by = c("concat_cc_spots" = "concat_spots_cc"),
                        method = "osa",
                        max_dist = 1,
                        distance_col = "distance")

cc_spots_1 <- cc_spots %>% # Isolate best matching cc and loyalty with more than 2 counts
  filter(!is.na(FullName)) %>% # Remove unknown drivers
  group_by(RoleNName,last4ccnum) %>%
  count() %>%
  arrange(RoleNName,-n) %>% # Arrange the highest to lowest count in each group
  ungroup()

cc_spots_1 <- cc_spots_1[!duplicated(cc_spots_1$RoleNName),] # Isolating 1 cc to 1 driver

cc_spots_1$last4ccnum[duplicated(cc_spots_1$last4ccnum)] # Need to remove five duplicates
[1] 6901 1310 7253 1415
55 Levels: 1286 1310 1321 1415 1874 1877 2142 2276 2418 2463 ... 9735
Show code
#write_csv(cc_spots,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_spots.csv")

#cc_loyalty_duplicate_cc <- cc_loyalty_1 %>% # Extract duplicates in cc data
#  filter(cc_loyalty_1$last4ccnum == cc_loyalty_1$last4ccnum[duplicated(cc_loyalty_1$last4ccnum)])

#cc_loyalty_duplicate_loyalty <- subset(cc_loyalty_1,loyaltynum == "L6267" | loyaltynum == "L3288") # Extract duplicates in loyalty data

Creating custom map of Abila with the use of tmap

Show code
Abila_st_union <- st_union(Abila_st) # Dissolve Abila road network
Abila_st_proj <- st_transform(Abila_st_union, crs = 3857)
Abila_st_buffer <- st_buffer(Abila_st_proj, dist = 25, nQuadSegs = 5, ) # Create a buffer around the dissolved Abila road network

gps_sf <- st_as_sf(gps, coords = c("long", "lat"), crs = 4326) # Changing into a shapefile
spots_median_sf <- st_as_sf(spots_median, coords = c("long.median", "lat.median"), crs = 4326) # Changing into a shapefile
gps_path <- gps_sf %>% # Creating a movement path
  group_by(id) %>%
  summarize(m = mean(Timestamp), 
            do_union=FALSE) %>%
  left_join(dplyr::select(car,CarID,RoleNName), by = c("id" = "CarID")) %>% #Add in RoleNName column
  st_cast("LINESTRING")  

# Create blue polygon as background to mimic sea
long.sea <- c(24.91075,24.91075,24.8232,24.8232,24.91075)
lat.sea <- c(36.09543,36.0445,36.0445,36.09543,36.09543)
sea <-data.frame(long.sea, lat.sea)
rm(long.sea)
rm(lat.sea)
sea_sf <- st_as_sf(sea, coords = c("long.sea", "lat.sea"))
st_crs(sea_sf) <- 4326
sea_poly<- st_sf(
  st_cast(
    st_combine(sea_sf$geometry),"POLYGON"
  )
)

# Clip a smaller Kronos island around Abila
Kronos_sf_small <- st_crop(Kronos_sf, c(xmin = 24.8232, xmax = 24.91075, ymin = 36.0445, ymax = 36.09543))

tmap_mode("view")
#tm_shape(mc2) +
#  tm_rgb(mc2, r = 1,g = 2,b = 3,
#       alpha = NA,
#       saturation = 1,
#       interpolate = TRUE,
#       max.value = 255) +
custom_tmap <- tm_shape(sea_poly) +
  tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
  tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
  tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==2)) +
  tm_lines(id = "RoleNName") +
tm_shape(spots_median_sf) +
  tm_dots(col = "Location.Type",
          id = "Location", # Bold in group
          popup.vars = "Location Type:" =="Location.Type",
          size = 0.2)

custom_tmap
Show code
#class(sea_poly)
#st_cast(sea_sf,"POLYGON")
#class(sea_sf)
#sea_sf <- st_multilinestring(sea_sf)
#st_cast(sea_sf, "POLYGON")
#sea <- matrix(c(24.90976,36.09543,
#                24.90976,36.04499,
#                24.82419,36.04499,
#                24.82419,36.09543,
#                24.90976,36.09543), ncol=2, byrow = TRUE)
#sea_poly <- st_polygon(list(sea))
#sea_poly <- st_set_crs(sea_poly, 4326)
#st_crs(sea_poly)
#class(sea_poly)
#gps_path_selected <- gps_path %>%
#  filter(id==3)
#gps_path_selected <- gps_path %>%
#  filter(id==1)
#st_crs(Abila_st_union)
#st_crs(Abila_st_buffer)
#print(Abila_st_buffer)
#class(Abila_st)
#Abila_st_proj <- st_transform(Abila_st, 32632) # UTM Zone 32N
#Abila_st_buffer <- st_buffer(Abila_st_proj, dist = 100)#write_sf(Abila_st_union,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\Abila_st_union.shp")
#write_sf(Abila_st_buffer,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\Abila_st_buffer.shp")
#write_sf(sea_poly,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\sea_poly.shp")

5. Data Exploration Analysis

Here we will answer the VAST Challenge questions.

5.1 Question 1 And Its Answers

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.

Food and beverage places seem to be the more popular locations.

Show code
cc_calendar <- cc %>% 
  count(dayofmonth, location)
cc_calendar$dayofmonth <- as_factor(cc_calendar$dayofmonth)

Q5.1.1 <- ggplot(complete(cc_calendar, dayofmonth, location), aes(x = dayofmonth, y = location)) + 
  geom_tile(aes(fill = n), color = "white", size = 0.1) +
  scale_fill_gradient(low = "light grey", high = "black", na.value = "light grey") +
#  scale_x_date(date_labels = "%a \n %d %b", 
#               date_breaks = "1 day") +
#  scale_x_discrete("dayofmonth") +
#                   ,
#                     n.breaks = 14) +
#  scale_x_discrete(expand = expansion(add = 1.6)) +
  scale_y_discrete(expand = expansion(add = 1.6),
                   limits=rev) +
  labs(title = "Calendar Heatmap of Location Visit Frequency By Date",
       x = "Day of Month",
       fill = "Frequency Of Visit") +
  theme_bw() +
  theme(axis.ticks = element_blank(),
        plot.title = element_text(hjust=0.5),
        panel.border = element_blank(),
        panel.spacing = unit(0.1, "cm"),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
#        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "bottom")
#        legend.title = element_text(size = 6),
#        legend.text = element_text(size = 6),
#        legend.key.size = unit(0.4, "cm")
#  xlab("Day of Month")

Q5.1.1

5.1.2 Kronos Mart’s Txns, From The Credit Card Data, Lags By One Day Overall, As Compared To The Loyalty Data.

Show code
Q5.1.2_cc <- cc %>%
  filter(location == "Kronos Mart") %>%
  dplyr::select(dayofmonth, price, location) %>%
  group_by(dayofmonth) %>%
  summarise(cc_data = sum(price))

Q5.1.2_loyalty <- loyalty %>%
  filter(location == "Kronos Mart") %>%
  dplyr::select(dayofmonth, price, location) %>%
  group_by(dayofmonth) %>%
  summarise(loyalty_data = sum(price))

Q5.1.2_combined <- data.frame(dayofmonth = c(6:19))
Q5.1.2_combined$dayofmonth <- as_factor(Q5.1.2_combined$dayofmonth)
Q5.1.2_combined <- Q5.1.2_combined %>%
  left_join(Q5.1.2_cc, by = "dayofmonth") %>%
  left_join(Q5.1.2_loyalty, by = "dayofmonth")

Q5.1.2_combined$cc_data[is.na(Q5.1.2_combined$cc_data)] <- 0
Q5.1.2_combined$loyalty_data[is.na(Q5.1.2_combined$loyalty_data)] <- 0

Q5.1.2_combined <-melt(Q5.1.2_combined, id.vars = "dayofmonth", variable.name = "source")

Q5.1.2 <- ggplot(Q5.1.2_combined, aes(dayofmonth, value, group = source)) +
  geom_area(aes(colour = source, fill = source),
            size = 1) +
  geom_point() +
  geom_text(data=subset(Q5.1.2_combined, value != 0),
            aes(label = value,
            group = source),
            vjust = -1,
            size = 3) +
  facet_grid(source~.) +
  ylim(0,500) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.position = "none")

Q5.1.2

5.1.3 What type of location is Hippokampos? It seems near to Gastech

5.1.4 Show spending outliers

5.1.5 Some txns at coffee chains occur only at selected hours in the credit card data

Show code
Q5.1.5_cc <- cc %>%
  left_join(dplyr::select(spots_median,Location, Location.Type), by = c("location" = "Location")) %>%
  filter(Location.Type == "Coffee Cafe") %>%
  dplyr::select(location, hour, price) %>%
  group_by(location, hour) %>%
  summarise(coffee_money = sum(price), .groups = "drop") %>%
  dcast(hour ~ location, value.var = "coffee_money")
Q5.1.5_cc$hour <- as_factor(Q5.1.5_cc$hour)

Q5.1.5_combined <- data.frame(hour = c(1:24))
Q5.1.5_combined$hour <- as_factor(Q5.1.5_combined$hour)
Q5.1.5_combined <- Q5.1.5_combined %>%
  left_join(Q5.1.5_cc, by = "hour")

Q5.1.5_combined <-melt(Q5.1.5_combined, id.vars = "hour", variable.name = "coffee_place")

Q5.1.5 <- ggplot(Q5.1.5_combined, aes(hour, value, fill = coffee_place)) +
  geom_bar(stat = "identity") +
  coord_polar(theta = "x") +
  ggtitle("Txns of Coffee Chains By Hour") +
  xlab("")+ylab("") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.ticks = element_blank(), 
        axis.text.y = element_blank(), 
        panel.background = element_blank(), 
        panel.grid.major.x = element_line(colour="grey"),
        axis.text.x = element_text(size = 15), 
        legend.title=element_blank())

Q5.1.5

5.1.6 What type of location is Hippokampos? It seems near to Gastech

5.1.7 Cases of 1 Credit Card Tied To 2 Loyalty Cards And Vice Versa

Show code
Q5.1.7 <- ggplot(complete(cc_loyalty_1, last4ccnum, loyaltynum), aes(x = loyaltynum, y = last4ccnum)) + 
  geom_tile(aes(fill = n), color = "white", size = 0.1) +
  scale_fill_gradient(low = "light grey", high = "black", na.value = "light grey") +
#  scale_x_date(date_labels = "%a \n %d %b", 
#               date_breaks = "1 day") +
#  scale_x_discrete("dayofmonth") +
#                   ,
#                     n.breaks = 14) +
  scale_y_discrete(limits=rev) +
  labs(title = "Calendar Heatmap of Credit Card And Matching Loyalty Numbers",
       x = "Loyalty Numbers",
       fill = "Matching Instance") +
  theme_bw() +
  theme(axis.ticks = element_blank(),
        plot.title = element_text(hjust=0.5),
        panel.border = element_blank(),
        panel.spacing = unit(0.1, "cm"),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
#        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "bottom"
#        legend.title = element_text(size = 6),
#        legend.text = element_text(size = 6),
#        legend.key.size = unit(0.4, "cm")
)
#  xlab("Day of Month")

Q5.1.7

5.1.99 Provide your answer and corresponding images here.

5.2 Question 2 And Its Answers

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

5.2.1a No vehicle movements on weekends before midday?

Show code
Q5.2.1a_weekday <- data.frame("weekday" = unique(cc[c('weekday')])) %>%
  slice(rep(1:n(), each = 5))
Q5.2.1a_period <- data.frame("period" = unique(cc[c('period')]))
Q5.2.1a_period$period <- factor(Q5.2.1a_period$period, 
                                levels = c("Morning 6am to 11.59am", 
                                           "Afternoon 12noon to 5.59pm", 
                                           "Evening 6pm to 8.59pm",
                                           "Late Evening 9pm to 11.59pm",
                                           "Late Night 12mn to 5.59am"))
Q5.2.1a_period <- as.data.frame(lapply(Q5.2.1a_period,rep,7))
Q5.2.1a_combined <- cbind(Q5.2.1a_weekday,Q5.2.1a_period)

Q5.2.1a_cc <- cc %>%
  group_by(weekday,period) %>%
  tally()
  
Q5.2.1a_combined <- Q5.2.1a_combined %>%
  left_join(Q5.2.1a_cc, by = c("weekday"="weekday","period"="period"))
Q5.2.1a_combined$id <- seq(1, nrow(Q5.2.1a_combined))
Q5.2.1a_combined[36:63,] <- NA
Q5.2.1a_combined[36,4] <- 5.1
Q5.2.1a_combined[37,4] <- 5.2
Q5.2.1a_combined[38,4] <- 5.3
Q5.2.1a_combined[39,4] <- 5.4
Q5.2.1a_combined[40,4] <- 10.1
Q5.2.1a_combined[41,4] <- 10.2
Q5.2.1a_combined[42,4] <- 10.3
Q5.2.1a_combined[43,4] <- 10.4
Q5.2.1a_combined[44,4] <- 15.1
Q5.2.1a_combined[45,4] <- 15.2
Q5.2.1a_combined[46,4] <- 15.3
Q5.2.1a_combined[47,4] <- 15.4
Q5.2.1a_combined[48,4] <- 20.1
Q5.2.1a_combined[49,4] <- 20.2
Q5.2.1a_combined[50,4] <- 20.3
Q5.2.1a_combined[51,4] <- 20.4
Q5.2.1a_combined[52,4] <- 25.1
Q5.2.1a_combined[53,4] <- 25.2
Q5.2.1a_combined[54,4] <- 25.3
Q5.2.1a_combined[55,4] <- 25.4
Q5.2.1a_combined[56,4] <- 30.1
Q5.2.1a_combined[57,4] <- 30.2
Q5.2.1a_combined[58,4] <- 30.3
Q5.2.1a_combined[59,4] <- 30.4
Q5.2.1a_combined[60,4] <- 35.1
Q5.2.1a_combined[61,4] <- 35.2
Q5.2.1a_combined[62,4] <- 35.3
Q5.2.1a_combined[63,4] <- 35.4

Q5.2.1a_combined <- Q5.2.1a_combined %>%
  arrange(id)
Q5.2.1a_combined$id <- seq(1, nrow(Q5.2.1a_combined))
Q5.2.1a_combined$period <- factor(Q5.2.1a_combined$period, 
                                levels = c("Morning 6am to 11.59am", 
                                           "Afternoon 12noon to 5.59pm", 
                                           "Evening 6pm to 8.59pm",
                                           "Late Evening 9pm to 11.59pm",
                                           "Late Night 12mn to 5.59am"))

Q5.2.1a_label <- Q5.2.1a_combined
Q5.2.1a_number_of_bar <- nrow(Q5.2.1a_label)
Q5.2.1a_angle <- 90 - 360 * (Q5.2.1a_label$id-0.5) /Q5.2.1a_number_of_bar
Q5.2.1a_label$hjust <- ifelse(Q5.2.1a_angle < -90, 1, 0)
Q5.2.1a_label$angle <- ifelse(Q5.2.1a_angle < -90, Q5.2.1a_angle+180, Q5.2.1a_angle)

Q5.2.1a_base <- Q5.2.1a_combined %>% 
  group_by(weekday) %>% 
  summarize(start=min(id), end=max(id) - 4) %>% 
  rowwise() %>% 
  mutate(title=mean(c(start, end))) 

Q5.2.1a_grid <- Q5.2.1a_base
Q5.2.1a_grid$end <- Q5.2.1a_grid$end[ c( nrow(Q5.2.1a_grid), 1:nrow(Q5.2.1a_grid)-1)] + 1
Q5.2.1a_grid$start <- Q5.2.1a_grid$start - 1
Q5.2.1a_grid <- Q5.2.1a_grid[-1,]


Q5.2.1a <- ggplot(Q5.2.1a_combined, aes(x=as.factor(id), y=n, fill=period)) +
  geom_bar(aes(x=as.factor(id), y=n, fill=period), stat="identity", alpha=0.5) +
  
  geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 120, xend = start, yend = 120), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 90, xend = start, yend = 90), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 60, xend = start, yend = 60), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 30, xend = start, yend = 30), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  
  annotate("text", x = rep(max(Q5.2.1a_combined$id),4), y = c(30, 60, 90, 120), label = c("30", "60", "90", "120") , color="grey", size=3 , angle=0, fontface="bold", hjust=1) +
  
  geom_bar(aes(x=as.factor(id), y=n, fill=period), stat="identity", alpha=0.5) +
  ylim(-100,150) +
  theme_minimal() +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank(),
        plot.margin = unit(rep(-1,4), "cm")) +
  coord_polar() + 
  geom_text(data=Q5.2.1a_label, aes(x=id, y=n+10, label=n, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=3, angle= Q5.2.1a_label$angle, inherit.aes = FALSE ) +
  
  geom_segment(data=Q5.2.1a_base, aes(x = start, y = -5, xend = end, yend = -5), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE )  +
  
  geom_text(data=Q5.2.1a_base, aes(x=title, y=-18, label=weekday), color="black", fontface="bold",alpha=0.6, size=3, inherit.aes = FALSE )
  
Q5.2.1a 

5.2.2 Large Distance Gap Between Consecutive GPS Records?

5.2.3 Weird Off-Road Driving by Isande Borrasca

Show code
tmap_mode("view")

Q5.2.3 <- tm_shape(sea_poly) +
  tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
  tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
  tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==28)) + # Extract Isande's path
  tm_lines(col = "black",
           lty = 1,
           id = "RoleNName") +
tm_shape(spots_median_sf) +
  tm_dots(col = "Location.Type",
          id = "Location", # Bold in group
          popup.vars = "Location Type:" =="Location.Type",
          size = 0.2)

Q5.2.3

5.2.4 Difference in locations between cc and gps data (Kronos Mart and others)?

5.2.5 Difference in cc and gps logs for President?

5.2.6 Axel Calzas does not go to work?

5.2.99 Provide your answer and corresponding images here.

5.3 Question 3 And Its Answers

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.

5.3.1 Provide your answer and corresponding images here.

5.4 Question 4 And Its Answers

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

5.4.1 Elsa (ID: 7, black line) and Brand (ID: 33, blue line) Are Seeing Each Other

Show code
tmap_mode("view")

Q5.4.1 <- tm_shape(sea_poly) +
  tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
  tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
  tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==7)) + # Extract Elsa's path
  tm_lines(col = "black",
           lty = 1,
           id = "RoleNName") +
tm_shape(gps_path %>% filter(id==33)) + # Extract Brand's path
  tm_lines(col = "blue",
           lty = 1,
           id = "RoleNName") +  
tm_shape(spots_median_sf) +
  tm_dots(col = "Location.Type",
          id = "Location", # Bold in group
          popup.vars = "Location Type:" =="Location.Type",
          size = 0.2)

Q5.4.1

5.4.2 21 Hennie Osvaldo Has Two Homes, of which he stays with a few others.

Show code
tmap_mode("view")

Q5.4.2 <- tm_shape(sea_poly) +
  tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
  tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
  tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==21)) + # Extract Hennie's path
  tm_lines(col = "black",
           lty = 1,
           id = "RoleNName") +  
tm_shape(spots_median_sf %>%
#           filter(grepl("Hennie", "Location")) +
           filter(Location == "Shared Home B - 14 Lidelse 18 Birgitta 21 Hennie" | Location == "Shared Home E - 13 Inga 15 Loreto 16 Isia 21 Hennie")) +
#           filter(Location.Type == "Residential")) +
  tm_dots(col = "green",
          size = 0.2)

Q5.4.2

5.4.99 Provide your answer and corresponding images here.

5.5 Question 5 And Its Answers

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why. Please limit your response to 10 images and 500 words.

Suspicious activities can be in the following form: 1) Explore presence of two or more individuals at the same location at the same hour for extended periods 2) Individuals Frequenting Unusual Places At Abnormal Hours

5.5.1 Presence of Anonymous Locations (shown as black dots on map)

Show code
tmap_mode("view")

Q5.5.1 <- tm_shape(mc2) +
tm_rgb(mc2, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
tm_shape(spots_median_sf %>%
           filter(Location.Type != "Unknown")) +
  tm_dots(col = "Location.Type",
          id = "Location", # Bold in group
          popup.vars = "Location Type:" =="Location.Type",
          size = 0.2) +
tm_shape(spots_median_sf %>%
           filter(Location.Type == "Unknown")) +
  tm_dots(col = "black",
          id = "Location", # Bold in group
          popup.vars = "Location Type:" =="Location.Type",
          size = 0.2)

Q5.5.1

5.5.2 Suspicious Monitoring of the Key Officials’ Homes (by Bodrogi, Vann, Osvaldo and Mies)

Showcasing only residential points, Bodrogi (ID: 15, black line), Vann (ID: 16, blue line), Osvaldo (ID:21, purple line) and Mies (ID:24, red line) were seen patroling key executives’ houses located near the centre area. (Hover over the lines and points to see the ID and owner of each residence)

Show code
tmap_mode("view")

Q5.5.2 <- tm_shape(sea_poly) +
  tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
  tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
  tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==15)) + # Extract Bodrogi's path
  tm_lines(col = "black",
           lty = 1,
           id = "RoleNName") +
tm_shape(gps_path %>% filter(id==16)) + # Extract Vann's path
  tm_lines(col = "blue",
           lty = 1,
           id = "RoleNName") +  
tm_shape(gps_path %>% filter(id==21)) + # Extract Osvaldo's path
  tm_lines(col = "purple",
           lty = 1,
           id = "RoleNName") +
tm_shape(gps_path %>% filter(id==24)) + # Extract Mies's path
  tm_lines(col = "red",
           lty = 1,
           id = "RoleNName") +  
tm_shape(spots_median_sf %>%
           filter(Location.Type == "Residential")) +
  tm_dots(col = "green",
          size = 0.2)

Q5.5.2

5.5.3 Suspicious Gas Station Activities

5.5.4 Suspicious Activity By Alcazar

5.5.5 Party on Jan 10 evening

5.5.6 Kronos Mart Midnight

5.5.7 Visiting Kronos Capitol

5.5.8 Bertrand Ovan driving a circle near midnight on 11th Jan 2014

5.5.9 President only started appearing late in the period

5.5.10 Maybe calculate speed and distance

5.5.99 Provide your answer and corresponding images here.

5.6 Question 6 And Its Answers

If you solved this mini-challenge in 2014, how did you approach it differently this year?

5.6.1 Question Not Applicable

We did not attempt this mini-challenge in 2014.

#—————————————————–

Show code
tmap_mode("view")
Q5.5.2 <- tm_shape(sea_poly) +
  tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
  tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
  tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==15)) + # Extract Bodrogi's path
  tm_lines(col = "black",
           lty = 1,
           id = ) +
tm_shape(gps_path %>% filter(id==16)) + # Extract Vann's path
  tm_lines(col = "blue",
           lty = 1) +  
tm_shape(gps_path %>% filter(id==21)) + # Extract Osvaldo's path
  tm_lines(col = "pink",
           lty = 1) +  
  tm_shape(gps_path %>% filter(id==24)) + # Extract Mies's path
  tm_lines(col = "red",
           lty = 1) +  
tm_shape(spots_median_sf %>%
           filter(Location.Type == "Residential")) +
  tm_dots(col = "green",
          size = 0.2)

Q5.5.2    



#  tm_dots(col = "Location.Type",
#          id = "Location", # Bold in group
#          popup.vars = "Location Type:" =="Location.Type",
#          size = 0.2)

Testing the use of a static image as background and showcasing the gps data on it using ggplot

Show code
# Mapping map and gps together specifically for CarID #1
#mapping <- ggplot(gps_name %>%
#                    filter(id == "1"),
#                  aes(long, lat)) +
#  annotation_custom(rasterGrob(mc2,
#    width = unit(1, "npc"),
#    height = unit(1,"npc")),
#  xmin = 24.8244, xmax = 24.9096, ymin = 36.0453, ymax = 36.0952) + #Original searching
#  xmin = 24.82419, xmax = 24.90976, ymin = 36.04499, ymax = 36.09543) + #Prof's raster extraction
#  geom_point(size = 0.1) + 
#  coord_fixed(xlim = c(24.8244, 24.9096), ylim = c(36.0453, 36.0952)) + # Fixing the scales regardless of filtering of points
#  theme_bw() + theme(panel.border = element_blank(), # Remove background and grids and reformat scales and axis
#                     panel.grid.major = element_blank(), 
#                     panel.grid.minor = element_blank(), 
#                     axis.line = element_line(colour = "black"))  
#  + transition_time(Timestamp) +
#  labs(title = "Date:{frame_time}")

#mapping

Past work:

We will use fuzzy string matching using Levenshtein distance which is available natively in R’s adist utilities package.

Show code
dist.concat <- adist(cc$concat,loyalty$concat, partial = TRUE, ignore.case = TRUE) #Creates a matrix with the Standard Levenshtein distance between both newly created concat columns
min.concat <- apply(dist.concat, 1, min) #Extract pairs with minimum distance

match.s1.s2 <- NULL
for (i in 1:nrow(dist.concat))
{
  s2.i <- match(min.concat[i], dist.concat[i,])
  s1.i <- i
  match.s1.s2 <- rbind(data.frame(loyalty.i=s2.i,
                                    cc.i=s1.i,
                                    loyalty_concat=loyalty[s2.i,]$concat,
                                    cc_concat=cc[s1.i,]$concat,
                                    adist=min.concat[i]),match.s1.s2)
  }
    
cc_loyalty <- match.s1.s2 %>%
  left_join(dplyr::select(cc, last4ccnum, ID), by = c("cc.i" = "ID")) %>% #Add in CC num column
  left_join(dplyr::select(loyalty, loyaltynum, ID), by = c("loyalty.i" = "ID")) #Add in loyalty card num column

4.1.2 Extracting Best Matching Pairs

Let’s now extract the matching credit card-loyalty pairs according to 80% matching of their comparative distance.

Show code
cc_loyalty_unique <- dcast(cc_loyalty, last4ccnum + loyaltynum ~ adist) #Long to wide by transposing adist
cc_loyalty_unique$Total <- rowSums(cc_loyalty_unique[,c("0","1","2","3","4","5","11")]) #Sum all rows
cc_loyalty_unique$Sum01 <- rowSums(cc_loyalty_unique[,c("0","1")]) #Sum only column 1 and 2
cc_loyalty_unique$MatchPctTotal <- percent(cc_loyalty_unique[,3] / cc_loyalty_unique$Total) #Calc % of perfect (0) matches against Total
cc_loyalty_unique$MatchPct01 <- percent(cc_loyalty_unique$Sum01 / cc_loyalty_unique$Total) #Calc % of perfect (0) and almost perfect (0) matches against Total

cc_loyalty_unique_80 <- cc_loyalty_unique %>% 
  filter (MatchPct01 >= "80.00%")
n_distinct(cc_loyalty_unique_80$last4ccnum)
n_distinct(cc_loyalty_unique_80$loyaltynum)

write_csv(cc_loyalty_unique_80,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_loyalty_unique_80.csv")
Show code
dist.concat2 <- adist(cc$concat2,spots$concat, partial = TRUE, ignore.case = TRUE) #Create a matrix with the Standard Levenshtein distance between both newly created concat columns
min.concat2 <- apply(dist.concat2, 1, min) #Extract pairs with minimum distance

match.s3.s4 <- NULL
for (j in 1:nrow(dist.concat2))
{
  s4.j <- match(min.concat2[j], dist.concat2[j,])
  s3.j <- j
  match.s3.s4 <- rbind(data.frame(spots.j=s4.j,
                                    cc.j=s3.j,
                                    spots_concat=spots[s4.j,]$concat,
                                    cc_concat2=cc[s3.j,]$concat2,
                                    adist=min.concat2[j]),match.s3.s4)
  }
    
cc_spots <- match.s3.s4 %>%
  left_join(dplyr::select(cc, last4ccnum, ID), by = c("cc.j" = "ID")) %>% #Add in CC num column
  left_join(dplyr::select(spots, RoleNName, No), by = c("spots.j" = "No")) #Add in Spots num column
Show code
cc_spots_unique <- dcast(cc_spots, last4ccnum + RoleNName ~ adist) #Long to wide by transposing adist
cc_spots_unique$Total <- rowSums(cc_spots_unique[,c("0","1","2","3","7","8","9","10","11","12","13","14","15","16","17","18","19","21")]) #Sum all rows
cc_spots_unique$Sum01 <- rowSums(cc_spots_unique[,c("0","1")]) #Sum only column 1 and 2
cc_spots_unique$MatchPctTotal <- percent(cc_spots_unique[,3] / cc_spots_unique$Total) #Calc % of perfect (0) matches against Total
cc_spots_unique$MatchPct01 <- percent(cc_spots_unique$Sum01 / cc_spots_unique$Total) #Calc % of perfect (0) and almost perfect (0) matches against Total

cc_spots_unique_80 <- cc_spots_unique %>% 
  filter (MatchPct01 >= "80.00%")
n_distinct(cc_spots_unique_80$last4ccnum)
n_distinct(cc_spots_unique_80$RoleNName)

write_csv(cc_loyalty_unique_80,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_loyalty_unique_80.csv")

Using fuzzyjoin (use Dcast median values)

Show code
spots_cc <- stringdist_join(cc,spots,
                       by = "concat_cc_spots",
                       mode = "left",
                       method = "jw",
                       ignore_case = TRUE,
                       max_dist = 0.2,
                       distance_col = "dist")

spots_cc_summary <- spots_cc %>%
  group_by(concat_cc_spots.x) %>%
  slice_min (order_by = dist, n = 1) %>%
  arrange (desc(dist)) %>%
  filter(dist < 0.1)

write_csv(spots_cc, "C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_cc.csv")

write_csv(spots_cc, "C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_cc_summary.csv")

With this, we then used the spots data above to find the coordinates nearest to these locations. We recorded the coordinates of each location in an Excel file, and would now bring it into R.

Show code
spot_list <- read_excel("data/spots and list.xlsx", sheet = "LIST") #Import Excel file

Change to GPS lines

Show code
#gps <- readr::read_csv("data/gps.csv") # Add gps data
#gps$Timestamp <- date_time_parse(gps$Timestamp,
#                zone = "",
#                format = "%m/%d/%Y %H:%M:%S")



  
#spots_sf <- st_as_sf(spots, coords = c("long", "lat"), # Changing into a shapefile
#    crs = 4326, agr = "constant")



#spots_l <- spots[,4:5] # Extract only lat long
#spots_ll <- spots_l %>% slice(rep(1:n(), each = 3648))
#spots_ll$No <- rep(1:3648, times = 3648)
#spots_l$Noo <- rep(1:3648, times = 1)
#spots_ll <- left_join(spots_ll,spots_l, by = c("No" = "Noo"))
#spots_ll$Diff <- sqrt((spots_ll$lat.x-spots_ll$lat.y)^2+(spots_ll$long.x-spots_ll$long.y)^2)

#n_perc(spots_ll$Diff < 0.05)

#write_csv(spots,"C:\\Users\\syeda\\OneDrive\\Desktop\\spots.csv")

# Mapping map and gps together specifically for CarID #1
#mapping_spots <- ggplot(spots, aes(long, lat)) +
#  annotation_custom(rasterGrob(mc2,
#    width = unit(1, "npc"),
#    height = unit(1,"npc")),
#  xmin = 24.8244, xmax = 24.9096, ymin = 36.0453, ymax = 36.0952) + 
#  geom_point(size = 0.1) + 
#  coord_fixed(xlim = c(24.8244, 24.9096), ylim = c(36.0453, 36.0952)) + # Fixing the scales regardless of filtering of points
#  theme_bw() + theme(panel.border = element_blank(), # Remove background and grids and reformat scales and axis
#                     panel.grid.major = element_blank(), 
#                     panel.grid.minor = element_blank(), 
#                     axis.line = element_line(colour = "black"))  

#mapping_spots
Show code
#loc1 <- 
cc_calendar_one <- cc %>%
  filter(location == "Katerina's Cafe") %>%
  count(date,hour)
  
cc_calendar_ggplot_one <-  ggplot(complete(cc_calendar_one, date, hour), 
    aes(x = date, y = hour)) + 
  geom_tile(aes(fill = n), color = "white", size = 0.1) +
  scale_fill_gradient(low = "light grey", high = "black", na.value = "light grey") +
  scale_x_date(date_labels = "%a \n %d %b", 
               date_breaks = "1 day") +
  scale_y_reverse() +
  labs(title = paste(loc1,"Visit Frequency By Date And Hour"),
       fill = "Frequency \n Of Visit") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5),
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

cc_calendar_ggplot_one